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O Abstract 

<^ Almost all of the current process scheduling algorithms which are used in modern 

operating systems (OS) have their roots in the classical scheduling paradigms which 
^ were developed during the 1970's. But modern computers have different types of 

Q software loads and user demands. We think it is important to run what the user wants 

at the current moment. A user can be a human, sitting in front of a desktop machine, or 

it can be another machine sending a request to a server through a network connection. 

We think that OS should become intelligent to distinguish between different processes 
^ and allocate resources, including CPU, to those processes which need them most. In 

this work, as a first step to make the OS aware of the current state of the system, we 
^ consider process dependencies and interprocess communications. We are developing a 

O model, which considers the need to satisfy interactive users and other possible remote 

users or customers, by making scheduling decisions based on process dependencies 
^-H and interprocess communications. Our simple proof of concept implementation and 

experiments show the effectiveness of this approach in the real world applications. 
l/^ Our implementation does not require any change in the software applications nor any 

special kind of configuration in the system. Moreover, it does not require any additional 
^ information about CPU needs of applications nor other resource requirements. Our 

experiments show significant performance improvement for real world applications. 
^ For example, almost constant average response time for Mysql data base server and 

^ constant frame rate for mplayer under different simulated load values. 

> 

^ 1 Introduction 



1.1 Motivation 

Almost all of the current process scheduling algorithms, which are used in modern operating 
systems (OS), have their roots in the classical scheduling paradigms which were developed 
during the 1970's. But today's computers have different types of software loads and user 
demands. It is not important to maximize CPU utilization, as most modern machines, either 
desktops or servers, have multiple cores/CPUs and most of the time they have idle CPU 
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cycles. It is not important to minimize job turn around time, because most machines are 
not running CPU intensive jobs. Most machines have a varying load pattern which depends 
on the requests coming from local and/or remote users. Often it may not be important to 
maximize system throughput. Typically throughput is defined in terms of no n- interactive 
jobs submitted to a machine, while most modern tasks have some form of interaction with 
users. What is important is to run what the user wants at the current moment. A user can 
be a human, sitting in front of a desktop machine, or it can be another machine sending a 
request to a server through a network connection. As we are looking at a wider range of users, 
we call them customers. In our view, local or remote customers may send different requests 
to a system. Due to rapid changes in customer demands and requests, the need for CPU 
time among different processes or groups of processes changes rapidly. Some OSs and authors 
have framed part of this problem as the user interactivity problem and have addressed it 
in several ways, but none of them have presented an easy to use solution |H1 El [HI [131 [E] • 
These solutions usually need some form of configuration and/or input from the user, or need 



additional information from applications. We will elaborate on this in Section 1.2 



We believe a key change in OS resource management is to make them aware of what 
the applications are doing on top of them. In other words, we think OS should become 
intelligent to distinguish different processes and allocate resources, including CPU, to those 
processes which need them most. We think the processes that need more resources are the 
ones which are externally observable at the time of scheduling. If a customer is waiting for 
a response from a process, then we say that, this process is externally observable. If this 
process is waiting for a service from another process, then the second process is also exter- 
nally observable. We believe that as a first step to make OS aware of what is happening 
in the system, process dependencies and interprocess communications should be considered. 
Unfortunately, commodity OSs do not support process dependency detection or interprocess 
communication detection. Although, OS kernel usually has some information about inter- 
process communications and process dependencies, they are generally dispersed in various 
unrelated kernel data structures, and the kernel does not use those information to make any 
process scheduling decisions or any other resource allocation decisions. 

In this study we are developing a model which considers the need to satisfy interactive 
users and other possible remote users or customers. This model makes scheduling decisions 
based on process dependencies and interprocess communications. We want to develop a 
scheduling algorithm which tries to minimize a user's dissatisfaction or unhappiness. We 
call this customer appeasement as it is not possible to make every customer satisfied spe- 
cially under heavy loads by running all processes fairly. A scheduling policy resulting from 
customer appeasement model is not a fair scheduling policy as it tries to find more important 
processes and give them more priority. This goal is achieved by a model which tracks process 
dependencies and communications using scalar values assigned to processes, customers and 
the whole system. Our simple proof of concept implementation and experiments show the 
effectiveness of this approach in real world applications. Our implemention does not require 
any specific change in the software applications or in the configuration of the system. More- 
over it does not require any additional information about CPU needs of applications and 
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other resource requirements. 
1.2 Related Work 

As mentioned in Section as far as we know, there are many studies which try to solve 
interactive or muhimedia apphcations scheduhng problems, but none of them has a broader 
view of finding the optimal scheduling solution based on a well defined criteria for all appli- 
cations, specially under heavy load. 

Most commodity OSs use some heuristics based on process execution/sleeping behavior 
to detect interactive processes to increase their priority and reduce their latency. Windows 
[TH] and FreeBSD [H] use multi-level feedback queue schedulers. In this scheme CPU-bound 
processes receive lower priorities and processes blocked waiting for I/O receive higher prior- 
ities. The Linux Vanilla or 0(1) scheduler [TT] (used in kernels before 2.6.23) has a similar 
mechanism. Processes with longer sleep times and shorter execution times are identified as 
interactive and receive higher priorities. Windows [13] adds more intelligence by differen- 
tiating processes waiting on different devices. For example, processes waiting on keyboard 
receive higher priority than those waiting on a disk. Etsion et al. and Yan et al. [T7] 
show that depending only on execution behavior is not adequate to distinguish interactive 
processes properly. Ingo Molnar, the designer of Linux CPS scheduler [TOl [12] tries to miti- 
gate this problem by not depending too much on process execution/sleeping behavior. CPS 
scheduler doesn't change interactive processes' priority any more, it only inserts them in 
front of the run queue every time an interactive process wakes up [TUl [12] (also see Section 
33|. 

Windows [I3] also uses "windows system input focus" as a measure of user interaction 
and it increases the priority of a process which has the input focus. Using input focus may 
help to improve interactivity performance but has several problems. If a user is running 
multiple interactive programs, for example an audio player and a web browser, while he/she 
is browsing the web and input focus is on the web browser, the user still wants the audio 
player to play the music well. Input focus mechanism also might not be usefull if a user 
interacts with the system through the network. 

Etsion et al. [8] use process display output production as a means of detecting interactive 
and multimedia applications. They schedule processes based on their display output produc- 
tion in a way that all processes have a chance to produce display output at the same rate. 
That might be usefull for multimedia applications where, for example, all video applications 
play at the same frame rate regardless of their window size. This approach only addresses 
desktop applications as any network user has no display access. Also, it might be possible 
that a compute intensive job creates a huge amount of disply output and receives an increase 
in its priority while it actually is not an interactive application. 

Some researchers and OSs, allow real time or interactive processes to specify their CPU 
requirements and time constraints. Por example in Mac OS X [T3], a real time process may 
ask for a specific CPU requirement. Yang et al. (RedLine) [T8j use almost the same principles 
and treat interactive processes like real time processes. In RedLine processes can ask for a 
specific CPU and other resource requirements. RedLine also has an admission mechanism 
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which may not allow the process to execute as an interactive process if the system does not 
have enough resources as requested by the process. 

Zheng et al. have an implementation called SWAP [19] which recognizes process depen- 
dencies but it does not distinguish interactive or any other type of processes which might 
need increased priority. It only tracks process dependencies based on system calls and pre- 
vent a high priority process being blocked by a low priority process that has locked a resource 
needed by the high priority process (priority inversion problem). 

Zheng et al. work called RSIO |20] has the most similarities with our work. RSIO 
looks at process I/O patterns as a way of detecting interactive processes. It also tries to 
identify other processes involved in a user activity and provide a scheduling policy to improve 
interactive performance. This policy is based on access patterns to I/O devices. RSIO 
needs a configuration file that defines which I/O devices should be monitored to detect 
interactive processes. It has also a relatively complicated heuristic mechanism to detect 
processes involved in a user interaction. 

1.3 Contributions 

The work presented in this paper has major differences from all of the previous work. 

1. We develop a model (customer appeasement model) with a criteria which tracks process 
dependencies and customer requests. This model gives us the ability to compare differ- 
ent schedulers analytically, and develop new scheduling policies based on the analytical 
results. 

2. Our model considers all of the process communications and dependencies in the system 
which are indications of a customer request. Other systems e.g. RSIO [20j typically 
consider a subset of communications related to a subset of I/O devices. 

3. The customer appeasement model objective is to improve the performance of any 
externally observable processes specially under heavy background loads, this includes 
traditional interactive processes, i.e. desktop and multimedia applications. 

4. We have a simple proof of concept implementation which does not need any config- 
uration file, or process specification information. It does not require any changes in 
the software applications either. It automatically and without any user assistance de- 
tects those processes which need more resources as defined in this paper, and increases 
their priority. This is in contrast to other work such as Redline [IB], RSIO [SU] or as 
allowed by OS X [H], which require some form of configuration or process resource 
specifications from user. 

5. Our experimental results show significant performance improvements for both interac- 
tive applications and server processes such as Apache web server [T] and Mysql [4J data 
base servers. For example, we observed almost constant average Mysql response times, 
and almost constant frame rate for mplayer under different (simulated) background 
loads. 
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6. One of our goals is to make the implementation simple, easily portable to different 
Linux kernels and distributions, and easy to use for a novice user. In order to achieve 
this, we use SystemTap [6J. SystemTap is a diagnostic tool, but it has made our imple- 
mentation a simple script which can be run on any SystemTap equipped distribution 
with a compatible kernel without the need to recompile and install a new kernel. Our 
script has been tested on kernel versions 2.6.31, 2.6.32, 2.6.35, but should be compatible 
with any kernel that has a recent version of CFS scheduler. 

The rest of this paper is organized as follows: We describe the customer appeasement 
model and its basic definitions in Section [2j In Section |3| we compute the unhappiness values 
for two simple scenarios for some of the classical and modern OS schedulers. In Section |4], we 
propose an algorithm to use unhappiness values to change process scheduling. In Section [5| 
we explain our simplified request based priority elevation technique. We give a more detailed 
explanation on the implementation in Section [6j In Section [7| we present our experimental 
results, concluding remarks are presented in Section |8} 

2 The Customer Appeasement Model 

In this section we introduce the customer appeasement model and explain the parameters 
and variables in detail. 

2.1 Definitions 

In the customer appeasement model we use the following terms, notations and definitions: 

Process: A process P is any software entity inside the system which can be scheduled to run 
on a CPU by the OS. (This definition includes tasks, threads or light weight processes.) 

Customer: A customer C is any outside entity which can send requests to processes in the 
system. Customers are independent from each other. We may distinguish between 
local and remote customers. 

Direct Request: A request R is any type of input from a customer to a process. Ri^j 
denotes the fc*'' request from customer Ci to process Pj. 

Indirect Request: A process may receive a request from a customer indirectly. This hap- 
pens when a process which has received a direct request from a customer in turn, sends 
a service request to another process. 

Weight of a request: is the weight or importance of the request Ri^j for the customer 
Cj. This may be measured or inferred from customer's behavior. 

Customer weight: W{ci) is the parameter which is used to distinguish between different 
customers. It represents weight or importance of customer q for the system. 
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Unhappiness: u'^_^j is an integer value related to the time delay that customer Cj experi- 
ences as a result of sending the request Ri^j to the process Pj. 

2.2 Computation of Unhappiness Value 

The amount of unhappiness assigned to a process due to a request, changes according to the 
rules explained in this section. The request might have been sent either directly or inderectly 
through another process to Pj. In the simplest situation, the unhappiness for process Pj at 
any moment is defined as the elapsed time since the moment process Pj receives Ri^j minus 
the amount of CPU time that Pj has been allocated. When Pj sends a response back to the 
customer Cj, then u^_^j is set to zero. Observe that: 

1. The unhappiness value u^^j increases as time passes. 

2. u^^j is decreased by the amount of time that process Pj runs on a CPU processing 
request Ri^^j- 

3. When process Pj requests a service from another process Pg and blocks, then u^_^, is 



divided between Pj and Pg as follows: 



u^^g = (1 - a)u'^^j (1) 

where < a < 0.5 is a system parameter, and it determines the amount of unhappiness 
passed to a service process when an indirect request is sent to such a process. Its exact 
value should be determined based on experiments during a specific implementation. 

4. The new unhappiness value u*^^ for Pj does not change while process Pj is blocked 
waiting for a service from other processes. But uf^^ will increase by time and in general 
follows the same rules for unhappiness computation. 

5. When the service process Pg finishes its processing and returns a response to Pj, effec- 
tively unblocking Pj by giving the requested service to it, the value of u'^^^ is passed 
back to Pj and is added to its previous unhappiness value. We call this new unhappi- 
ness value u*^j. At this time the unhappiness value assigned to service process is reset 
to zero: 



uLs = (2) 
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We can compute the total unhappiness for a request, a customer {W^') or the whole system 
(U). The total unhappiness for a request Ri^j is computed as the following summation which 
indicates the total unhappiness that customer Ci experiences as a result of sending request 
Ri^j to process Pj and all other delays that are caused as process Pj waits for services from 
other processes. 

N-l 

= (c,)^f Yl ""U (3) 

3=0 

The total unhappiness for customer Ci is computed using Equation |4j In this equation 
R is the total number of requests sent by Ci and is the total number of processes in the 
system: 

N-l R-1 

U^^ = W{c.)J2J2'''^'^^ (4) 

j=0 k=0 

The system unhappiness due to all requests from all customers is computed using Equa- 
tion m 

M-l N-lB-l 

f/=^l^(Q)5^5^«;^f_,. (5) 

i=0 j=0 k=0 

The objective of the scheduling algorithm should be to minimize the system's unhappiness 
U at any time. 

3 Unhappiness Values in different Scheduling Algo- 
rithms 

In order to find out how some of the scheduling algorithms perform under the customer 
appeasement model, we compute the unhappiness value that they cause for a request sent 
by a customer to a system in two specific scenarios. We compute the unhappiness value for 
the following schedulers, and simplify final values as much as possible so that the results are 
comparable. 

3.1 Round-Robin Scheduling 

The Round-Robin (RR) Scheduler is a simple preemptive scheduling algorithm which was 
used in time sharing systems [T5]. It is still used as part of some modern scheduling al- 
gorithms, for example it is part of Linux real time (RT) scheduling class. Round-Robin 
scheduler gives each process a time slice or time quantum q, if a process releases the CPU 
before q is finished, then the scheduler runs the next process in the ready queue. If a process 
needs more time and finishes its time quantum, then the scheduler preempts the process. 
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inserts the process at the tail of the ready queue, and schedules the next process from the 
head of the ready queue. This new running process also receives a time quantum q. 

Now we consider a simple case and compute the minimum and a typical unhappiness 
values caused by a request in a system with RR scheduler. The minimum and typical 
unhappiness values might happen in the best case and a typical case scenarios respectively. 
We assume that there are N — 1 running processes in the run queue. We assume that there 
is a process Pj in the sleeping state waiting to receive a request. There is only one customer, 
and the system parameter a is set to zero [a = 0). This customer sends a request to the 
sleeping process Pj and waits for the response. Now the OS wakes up process Pj and inserts 
it to the end of the ready queue. Assume that the N — 1 running processes stay running all 
the time, and they use all of their time quanta , so Pj waits in the ready queue for q{N — 1) 
seconds before it runs on the CPU. If it can finish processing and return a response to the 
customer during its first time slice q, then q{N — 1) is the amount of unhappiness during 
this transaction. If it can't return a response to the customer during this time and needs a 
total of Z time quanta to finalize this transaction and return a response to the customer, 
then the maximum unhappiness will be: 

= Zq{N -1)-{Z- l)q = q{Z{N - 2) + 1) (6) 

In practice it is possible that process Pj blocks and waits for services from other processes. 
Assume that it waits for a service from process Ps after Zi time quanta, and Ps needs Z2 
time quanta to process P^'s request and return a response. Pj may also need another Z3 time 
quanta to return a response to the customer. We assume that Pg is in the sleeping state prior 
to receiving a request from Pj, that means, there are always N running processes, because 
when Pj blocks and sleeps, Pg wakes up and is in the running state. Then the amount of 
unhappiness experienced by the customer due to the request R will be: 

C/^ = q{Z^{N - 2) + 1) + q{Z2{N - 2) + 1) + q{Z^{N - 2) + 1) 

= g((Zi + Z2 + Z3)(Ar-2) + 3) (7) 

Here the unhappiness value consists of three terms. The first term is the aggregated 
unhappiness caused by delays in the execution of Pj at the time it blocks waiting on Pg. The 
second term is the amount of unhappiness caused by delays during the execution of P5, and 
the last term of the unhappiness value reflects the delays of running Pj after it receives the 
response from Pg until it sends the final response to the customer. So the best case and a 
typical case scenarios with a Round-Robin scheduler results in the following minimum and 
typical unhappiness values: 

C/^,„ = q{Z{N - 2) + 1) (8) 
C/^ = ?((Zi + Z2 + Z3)(A^-2) + 3) (9) 

Note that the unhappiness value related to running the service process Pg is set to zero 
once it sends the response back to Pj. 
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3.2 Multilevel Feedback Queues 



In this subsection we perform the same analysis for a basic multilevel feedback queue schedul- 
ing. Many UNIX OSs such as FreeBSD [9J utilize some form of a multilevel feedback queue 
scheduler. Windows also has a multilevel feed back queue scheduler [13]. As another example 
the 0(1) or Vanilla scheduler in Linux kernels before 2.6.23 is in fact a multilevel feedback 
queue with many heuristics involved in moving tasks between diffrent queues and detecting 
interactive processes [TT]. 

Assume a basic multilevel feedback queue scheduling algorithm with m queues called Qq 
to Qm-i- The processes in each queue can run on the CPU for a multiple of time quantum 
q. The amount of time for Qi is computed as 2*g. Each process is first placed in Qq. After 
it receives its CPU share in Qo, if it needs more time it is then placed in the next queue 
Qi and so on. Each queue has an absolute priority relative to the next queue, meaning 
that processes in Qi+i do not execute until all processes in Qi receive their CPU share and 
Qi becomes empty. So in this algorithm, processes which need more CPU time, lose their 
priority as time passes but receive a,larger time quantum when they run on the CPU. 

Now assume an interactive process wakes up and receives a request from a customer. Also 
assume that there are a total of processes in Qi. Assume that the interactive process needs 
Zq processing time to finish processing and return a response to the customer, and no other 
processes will enter the running queues during this time. This means that the interactive 
process is inserted to the end of Qq, receives its CPU time after waiting for other processes in 
this queue, then it is pushed to the next queue and so on. Assume Qx is where it receives the 
final amount of CPU time that it needs to finish processing the request and return a response 
to the customer. The amount of unhappiness that the customer experiences is computed as: 

= (ao - l)q + 2(ai - l)q + ... + 2"(a, - l)q - (g + 2g + ... + 2"^^^) 

X x—1 

= q{J2na^-l)-J2^l (10) 

1=0 i=0 

In this equation the first summation indicates the amount of delays that the interactive 
process encounters waiting in ready queues, and the second summation indicates the amount 

X 

of CPU time it has received. We can compute x by solving this equation T = Z and 

1=0 

then simplifying the minimum unhappiness value as follows: 



a; = log2(Z + l)-l (11) 

log2(Z+l)-l 

vim = g( E 2^(«^ - 1) - 2^°^^(^+^)-^) (12) 

1=0 

We can also examine a more complicated scenario as in the previous subsection. Assume 
that Pj wakes up and receives a request. It then spends Z\ second to partially process this 
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request and send a request to a service process P^. Now Ps needs Z2 seconds to provide 
the service to Pj. After Pj receives the service, it needs another Z3 seconds to finahze its 
processing and return a request to the customer. Please note that in this scheduler each 
time a process wakes up, it is inserted to the end of Qq. We compute unhappiness values 
assuming that the system parameter a is set to zero (a = 0) see Section 12} 



log2{Zi+l)-l log2(Z2 + l)-l 

U^ = q{ J2 2'(ai - 1) - 2'°S2(^i+i)-i) + g( ^ 2'(a, - 1) - 2'°S2(^2+i)-i) 

i=0 i=0 
log2(Z3 + l)-l 

+ q{ 2\ai - 1) - 2'°^^^^^'+^^-^) (13) 

i=0 

Please note that the total unhappiness value consists of three terms. The first term is 
the result of delays during the first part of processing the request by Pj. The second term 
is caused when the service process Pg is in the run queue, and the last term is associated 
with the waiting when Pj prepares the final response for the customer. This scenario can be 
a typical situation while it is possible to have even more complex cases where Pj may need 
more services from Pg or other service processes. This leads to higher unhappiness values. 



3.3 Linux CFS Scheduler 

The Completely Fair Scheduler or CFS for short [TOl [T2j was introduced in Linux kernel 
version 2.6.23. As its developer explains [TOl [12], "it is designed to basically model an ideal, 
precise multi-tasking CPU on real hardware" . 

It allways tries to share the CPU fairly between the current processes in the run queue. 
In other words if there are processes in the run queue, it promises to run each process 
with of the CPU power. In order to achieve this goal, CFS scheduler keeps track of a 
variable called virtual run time {vruntime) for each task. It is a weighted run time for each 
task. CFS uses an R-B tree to choose the next task to run on the CPU. It simply chooses 
the left most task in the tree which has the lowest vruntime value. If a new process enters 
the run queue, CFS manipulates its vruntime value such that the new arriving process goes 
to the right of the R-B tree. This is to make sure that it can keep its promised wait time 
to the current running processes. CFS also gives an advantage to the sleeping processes. If 
a process sleeps less than a threshold time interval then CFS changes its vruntime value 
such that it goes to the left most position in the R-B tree when it wakes up. Assuming that 
an interactive task is a short sleeper, this will lead to better response times for interactive 
tasks. CFS does not have a fixed time slice. At the time it runs the next task it gives the 
task a time slice which is computed as follows: 

schJat 

Timeslice = q{Nj = — — — (14) 
In this equation schJat is a CFS constant value and A^ is the number of tasks in the run 
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queue. CFS stretches this time shce if the number of running tasks increases beyond a system 
threshold. q{N) is the basic time shce in CFS if process weights and nice values are ignored. 

Another intresting property of CFS scheduler is the way nice values work. Nice values 
change the weight of a task, which means that they change the vruntime of a task. If task 
weights and nice values are considered then CFS changes the CPU share of each task based 
on the following relations: 

,^ WjSchJat 
q{P„N) = ^ (15) 

wniceo = 1024 
wnicei-i = 1.25wnicei 



Wi in equation 15 is the weight of Pj assigned by CFS, and changes directly based on the P/s 
nice value. For example if there are two tasks A and B running on a single CPU machine 
and both have a default nice value of then each receives 50% of the total CPU time. If task 
y4's nice value is changed to —1 then it receives 55% of the CPU time and task B receives 
45% of the CPU power. 

Now we compute the unhappiness values experienced by a customer in a system with 
CFS scheduler. Assuming that all running processes have their default nice value of zero, 
we have: 

1024 schJat schJat , . , . 

1024 N = — = ^(") 

Assume that there are — 1 running processes in the run queue and an interactive 
task Pj is sleeping. Assume a customer sends request R to Pj. Pj wakes up and enters 
the run queue. Now we compute the minimum unhappiness that may be experienced by 
the customer. In the best case scenario, it is possible that CFS changes the vruntime of 
Pj such that it is placed in the left most leaf of the R-B tree and it then may preempt 
the current running process. Please note that now the number of running processes is A^. 
Assume that Pj needs r seconds to finish processing the request. If r < q{N) then Pj can 
finish processing the request without interruption and returns a response to the customer 
and the total unhappiness value is zero. 

U^^n = (17) 

Now we assess a typical scenario where t > q and the interactive task blocks to receive 
a service from service task Pg. We assume that the system parameter a = 0, and when 
Pj blocks, it transfers all of its unhappiness to Pg. At this time Ps enters the run queue 
so the total number of running jobs does not change. Additionally we also assume that no 
other task enters or leaves the run queue while the request R is being serviced. So the total 
number of running tasks is always A^. Assume that Pj needs ti seconds for processing the 
request R before blocking and requesting service from Pg, and Pg needs T2 seconds to return 
the requested service to Pj, and Pj needs another rs seconds to return a response to the 
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customer. To simplify the computation we use r = Ti + T2 + T3. The total unhappiness 
experienced by customer due to this request is: 

^^H^-l)(^-l)-r. + (^-l)(^-l)-r. + (^-l)(^-l)-.3 

= - +T2 +T3) - 3) - (ti +T2 +T3) 

= (^-l)(^r-3)-r (18) 

4 Scheduling Based on Unhappiness Values 

Up to this point, we have developed a model which can give us an indication of how an 

05 performs regarding the requests it receives from different customers. Interstingly, the 

way we have defined unhappiness, and compute its value, and the way it is inherited by 
processes, highlights the dependency between processes which are responsible for a particular 
request. We can look at this as a way of coloring a process dependency subgraph which is 
involved in responding to a particular request, and finding the process which creates the most 
unhappiness value at the current time. The objective is to minimize system unhappiness. The 
very first idea to achieve this is to find the request which creates the maximum aggregated 
unhappiness and allocate the CPU to the processes responsible for serving that request. In 
theory we can achieve this by performing the following steps. 

1. Wc assume that, there arc two queues in the system. One for unhappy processes which 
is called Qq and the second queue for regular processes which is called Qi. 

2. All processes with nonzero unhappiness values are placed in Qo and processes with 
zero unhappiness values are placed in Qi. 

3. Qo has an absolute higher priority than Qi, so there should be no processes in Qq 

before processes in Qi can be executed. 

4. In order to determine which process to execute next from Qq, the scheduler first com- 
putes the unhappiness values for all pending requests in the system. 

5. Based on the requests' unhappiness values computed in the previous step, the scheduler 
chooses the request with the highest unhappiness value. 

6. The scheduler then, checks all processes responsible for that request. In other words 
it checks the dependency subgraph of the processes that are servicing that particular 
request and chooses the process with the highest unhappiness value to run on the CPU. 

7. Processes in Qi are executed based on a regular system scheduling algorithm for ex- 
ample Linux CFS scheduling algorithm. 

8. When a new process is created it should be given a nonzero unhappiness value so that 
it has a chance to start faster, then if it does not serve requests, it is moved to Qi. 
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5 Request Based Priority Elevation for CFS 



Unfortunately, implementing the simplest version of the proposed algorithm in Section |4] 
requires that the scheduler detects all requests to all processes, and also requires detecting 
responses from processes to customers. At present detecting a response from a process to 
a particular request seems to be impossible without process cooperation. As a result of 
facing these difficulties, and in order to create a proof of concept implementation, we decide 
to take a simplified approach. We begun by observing a typical Linux desktop which was 
also configured as a small web server. This system had a Linux kernel version 2.6.31 and 
hence it uses CFS scheduler. We used SystemTap [0] and strace [S] to trace system calls 
and interprocess communications. We observed that most of the requests from a desktop 
user is passed to processes through UNIX sockets. Desktop components also mostly use 
UNIX sockets to communicate with each other [21 13]. Requests from remote customers also 
enter the system through network sockets. Based on these observations we propose a simple 
priority elevation technique in Linux kernels with CFS scheduler to approximately minimize 
system unhappiness as defined in Section [2j 

1. Assume that each process receives the incoming requests through a non-zero socket 
read. 

2. Since a pending request increases system unhappiness, whenever a process receives a 
request (non-zero socket read), scheduler should increase its priority or CPU share. 

3. The process should be able to maintain it's elevated priority until it sends a response 
to the customer. But, as we can't detect the exact time when it sends a response back 
to the customer, scheduler decays the elevated priority in time. 

4. In order to minimize interaction with regular system scheduling, we change the amount 
of priority elevation and decaying speed based on the system load. As the system load 
increases, an eligible process receives higher priority and retains this higher priority for 
a longer time. This is based on the fact that when a Linux OS with CFS scheduler has 
a higher load, each process receives a smaller share of the CPU time flUl 112] . So, in 
order for a given process that receives a request to be able to respod to the request as 
if there is little or no load on the system, it should receive a larger share of CPU time 
relative to other processes. By increasing its priority more aggressively under heavier 
loads, we allocate more CPU time to such a process, as a result, it has a better chance 
to finish its computation and return a respond to the request in a shorter time interval. 

In the rest of this paper we refer to this method by its abbreviation CFS/RBPE. 
5.1 Unhappiness Values for CFS/RBPE 

In this subsection we compute theoretical unhappiness values for the priority elevation tech- 
nique as we did for other schedulers in Section [3j Again we consider two scenarios under this 
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scheme and compute the amount of unhappiness observed by the customer. The assump- 
tions are mostly the same as what we assume in Section |3.3 There are — 1 tasks in the 
run queue of a single processor machine. Task Pj is an interactive job which is sleeping. A 
customer sends a request R to Pj. It wakes up and is inserted in the run queue. Assume 
that priority elevation mechanism detects the request and uses negative nice value —nice, 
such that —15 < —nice < to increase Pj's priority. Assume that elevated priority decays 
with a speed of 1 nice level per each d seconds. Pj needs r seconds to finish computation 



and return a response to the customer, and d < t. As we explained in Section 3^, each 
negative nice level increases the task's weight to 1.25 times its previous value. So if there 
are tasks including Pj in the run queue and Pj has a negative nice value —nice then the 
following relations hold: 

1.25™^^ schJat 

q{P„N) 



nice 



N -1 + 1.25 
schJat 
li^ij^j) = ]\mY^n~25^ 

Assume r ^ 1.25^^'^^^ ^^J^J^^ then as CFS enqueues an interactive task into the left of R- 
B tree, it is possible that Pj finishes computation and returns a response to the customer 
within this time period. This means that, it is possible that the customer observes zero 
unhappiness. 

U^^n = (20) 

Now, let us consider a typical scenario where r ^ ^''^"i^i 2^nde* = Q and Pj also needs 
a service from process Pg. We also assume that Pj first requires ti ^ Ziq seconds before it 
blocks for service from Pg. Ps requires T2 ~ to provide service to Pj, and eventually Pj 
requires another T3 ~ Z^q to return a response to the customer. Also assume that d ^ q, 
so Pj and Pg nice levels increase almost every time it is rescheduled after it is set to —nice 
by the scheduler. For simplicity we also assume that ^1,^2,^3 ^ nice + 1. Based on these 
assumptions the amount of unhappiness observed by customer is: 

^ _ {N - l)schJat I 25inice-i) 



i=0 i=l 
Z2-2 11, Z2-I 



+ 



{N-l)sch_lat ^ I2^{nice-i) 



]V - 1 + 1 25(™'=^-*) ^ - 1 + 1 25(™'^^-^) ■ 

2=0 ■ 2 = 1 



{N-l)schJat y^ X2h^nice-i) 

j=0 ■ 1=1 

Assuming the total time needed to return a response to the customer is r, then we have 
r = Ti + r2 + Ta. Note that the CPU time that Pj and Pg receive is approximately the total 
execution time that they need to return a response to the customer request, so: 

^ ~ ^ AT - 1 + 1.25(™^e-*) ^ ^ jv - 1 + 1.25(™^^-^) ^ ^ A^ - 1 + l.25(™ce-i) 



14 



And we can write as the following: 

'^^^ (A^ — l)schJat (N — l)schJat '^^^ (A^ — l)schJat 

i=0 i=0 i=0 

(23) 

Please note that the above calculations are based on the assumption that all communi- 
cations/requests between processes are performed using network or UNIX sockets. Based on 
our observation this is a valid assumption for most Linux desktop applications/components. 
Another fact is that most transactions need more than just one read operation. For example 
a simple click on a link in a web browser causes many UNIX socket read system calls both 
in the web browser and the X server before a new page is displayed on the screen. This in 
effect causes the active processes in the transaction (in this case web browser and X server) to 
receive the —nice value multiple times. It means that in practice, they have higher priority 
for a longer period of time than what we compute here. 

5.2 CFS vs CFS/RBPE 

In this subsection we compare and discuss unhappiness computations for CFS scheduler and 



CFS with RBPE. As we see in sections 3.3 and 5.1 the minimum unhappiness values for the 
best scenarios in both cases are zero. So at the very minimum we can see that in theory our 
proposed RBPE scheme does not make the situation worse. We can note that there are two 
conditions for the best case scenarios with the resulting zero unhappiness. These conditions 
are: 

schJat ^ , 
T ^ — -— For CFS. 24 

A^ ^ ' 

r ^ '-^^ For CFS/RBPE. (25) 

A^ - 1 + 1.25™'=^ ' ^ ' 

As jv-i+Trs""''^ ^ 'h clearly under CFS/RBPE, Pj has more time to respond to the request 
before it is preempted than under CFS. So, there is higher possibility under CFS/RBPE that 
a request being responded without encountering any unhappiness. 

For typical scenarios in both cases again we have: 

U^ = (N-l){^- 3)-r For CF Sand (26) 

^ ^\schJat ^ ^ ' 

Zi-2 ^ Z2-2 



(N - lyschJati Y: ^ _ 1 + L5(— + S 



_. _ , A^ - 1 + l.25(™^"-^) 

i=0 i=0 
Z3-2 ^ 

+ Y ITr — Vl^- — ^) - ^ For CFS/RBPE (27) 

In the CFS/RBPE result of higher priority for Pj and Ps, other tasks have less 

CPU time, so the resulting value is less than that of CFS case. 
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6 Implementation 



While we were using SystemTap [6J to observe process/OS interactions and behaviors, we 
found it extremely powerful to write simple scripts which can be used with different kernel 
versions without almost any modification. So, in order to implement a simple Request Based 
Priority Elevation (RBPE) mechanism as a proof of concept for our customer appeasement, 
we decided to use SystemTap in its Guru mode. When used in this mode, SystemTap 
enables parsing of expert-level constructs like embedded C. So it basically enables us to 
write C code and insert it into the kernel as a kernel module, effectively modifying a running 
kernel without directly modifying kernel source code or recompiling it. 

We use SystemTap to create a list of process (PIDs) that recently have called a socket 
related receive/read system call with nonzero return value. Assuming this call means that 
the associated process has received either a direct or indirect request from a customer, we 
increase its priority. The exact negative nice value used to increase the process priority 
depends on the system load. The higher the system load the lower the negative nice value. 
The applied negative nice value along with a time stamp is saved in a list with the associated 
process PID. We call this list elevated priority process list (EPPL). There is a time delay after 
that we increase the negative nice value of the processes which are in the EPPL, effectively 
reducing their priority. The exact value of this time delay also depends on the system load. 
During our process behavior observation period, we noticed that, the majority of processes 
waiting for an input, use the poll system call periodically. We changed the poll system call 
and use it as a point to check the current system status and update the state of the processes 
which are in the elevated priority list. Each time a poll is called, we check EPPL and increase 
the nice value by one for each process that has passed its delay time. We then enter the 
new nice value with a new time stamp into the elevated process list. For each process in the 
list, this action will continue until the nice value becomes zero, at that point the process is 
deleted from the list. We adjust the values of initial nice values and delay time, based on the 
system load during poll system calls. If system load is very low, the RBPE mechanism does 
not interfere with the regular CPS scheduling decisions, but, as the system load inceases it 
interferes aggressively, as mentioned earlier. Table [6] shows the initial nice vales and delay 
times at different system loads in the current version (0.5) of our script. 

In our implementation we discriminate local and remote users by giving higher priority 
to processes receiving requests through UNIX sockets relative to those processes receiving 
requests through network sockets. This is implemented by using two initial negative nice 
values. One with a lower value for processes that use local UNIX sockets and one with 
higher value for processes that use network sockets to receive requests. All of these values 
are presented in Table |6| 

Although this method of implementation might have a higher overhead, but as we see in 
Section [7], it has very promising results for real world applications. 
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avenrun[0] < 


-nicel (UNIX sockets) 


-nice2 (Network sockets) 


Time Delay (ms) 


1600 











3000 


-1 





200 


5000 


-2 


-1 


300 


8000 


-4 


-2 


400 


12000 


-6 


-3 


500 


16000 


-7 


-4 


600 


> 16000 


-15 


-5 


600 



Table 1: RBPE script, nice and time delay values. avenrun[0] is a kernel variable which 
represents system's 1 minute load. 



7 Experiments 

To test our CFS/RBPE technique, we performed multiple tests. As we intend to show that 
our scheduling paradigm does not focus on interactive applications, we have performed server 
based performance tests as well as regular interactivity/multimedia tests. Server based tests 
include Apache web server [1] performance test and Mysql data base server [1] performance 
tests. We choose these two servers as they are very popular. In fact many Linux based 
servers use Apache, Mysql and PHP to support different weblog, wiki or multimedia hosting 
services. 



7.1 Hardware/ Software Set up 

All tests are performed on an IBM (R) IntelhStation M Pro with Intel P4 2.8GHz CPU and 
1GB of RAM running Fedora 12 with Kernel 2.6.32. 

In order to simulate different background system loads we compile Linux kernel and use 
different -j values with the make command to initiate different parallel compilations. 



7.2 Apache Web Server Response Test 

As a first test to measure server based application performance we measure the response 
time of Apache web server under different system loads. The web server hosts a directory 
structure of files. We use a second machine and wget command to download the complete 
directory structure from the web server. We use shell time command to measure wall clock 
download times under different simulated load conditions. For each simulated load value 
we repeat the experiment 3 times and compute the average download time for that system 



load. As we see in Figure 7.2, both CFS and CFS/RBPE have the same response time for -j 
values up to 2 after that point CFS/RBPE has consistently lower response times. When -j 
value equal to 30 is used Apache web server is almost 1.5 times slower when it is run under 
CFS than under CFS/RBPE. 
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Figure 1: Download times in Seconds from an Apache web server under different simulated 
load values. 



7.3 SysBench Mysql Data Base Test 

As a second server based software, we test Mysql response times under different background 
load conditions. We use Stench [16] to evaluate and compare Mysql data base performance 
under Linux CFS jTOl [12] and CFS/RBPE scheduling policy. Stench is a multi-threaded 
benchmark tool which can evaluate OS parameters important for a system running on a 
data base server under heavy load. We use Stench default parameters for its OLTP complex 
configuration. During each experiment Stench executes many read and write transactions on 
the Mysql data base server, and then gives the average transaction time. Each experiment 
runs for 120 to 250 seconds under different simulated load conditions. We change experiment 
times in order to have around 3000 transactions per each experiment. The reason is that, 
when system load increases the total number of transactions in a fixed time interval decreases. 
So we increase the experiment time to have almost equal number of transactions for each 
experiment. This gives us a better average transaction time in all experiments. Figure 7.3 
shows the average transaction time in milliseconds under different simulated load conditions. 
Load conditions were simulated by running a Linux kernel compilation job with different -j 



options to the make command as in the previous test in Section 7.2 



As we see in Figure 7.3, the average Mysql tarnsaction time under CFS/RBPE increases 
first as -j value increases to 1 (meaning from no load to one parallel kernel compilation) 
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Figure 2: It shows the average transaction times in milhseconds under different simulated 
load conditions. The X values are -j values passed to the make command to specify number 
of parallel compilations to initiate. 

but, then it decreases when parallel compilation increased to two and three. This is because 
RBPE does not interfere with the usual CFS scheduling when system load is low. When 



the system load increases RBPE is involved and as we see in Figure |7.3[ it boosts Mysql 
performance so that its average transaction time is almost constant near 40 milliseconds. 
In contrast, under CFS scheduling, Mysql average transaction time increases as system load 
increases. When we use a -j value of 12, the average transaction time reaches 85 milliseconds 
which is more than two times that of CFS/RBPE. 

The results of the experiments in this section and the previous section indicate that 
the proposed scheduling paradigm in this paper is not a method to just boost desktop 
interactive applications response time. This mechanism can boost the performance of any 
request/response based transaction in the system. 

7.4 Interactivity/Multimedia Test 

In order to test and compare the performance of interactive/multimedia applications under 
CFS and CFS/RBPE schedulers, we use mplayer in benchmark mode. In this mode mplayer 
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Figure 3: This figure compares the frame rate drop when mplayer is playing an mpeg movie 
clip and a simulated background load was increasing. 

prints out the number of dropped frames and average frame rate after it finishes playing a 
multimedia file. We use a short mpeg clip of size 352 x 288 pixels, which runs for about 
150 seconds. The clip frame rate is 25 frames per second. Again we simulate the system 
load with parallel compilation of Linux kernel and use make -j with different values for -j to 
control the number of parallel makes. 

For each -j value the experiment is repeated three times and the average value of dropped 



frames is depicted in Figure 7.4 As we see in this graph under CFS/RBPE the frame drop is 
almost zero for all load values up to -j 12. Under CFS the number of frame drops significantly 
increases after -j 4- 

We also depict the average frame rate of mplayer for CFS and CFS/RBPE under different 



simulated load levels. As we see in Figure 7.4, mplayer frame rate drops to almost 8 frame 



per second under CFS when 12 parallel compilation is running, while at the same load level 
CFS with RBPE shows almost no frame rate reduction. 

This experiment shows the effectiveness of CFS/RBPE on a typical multimedia or stream- 



ing application. As Figures |7.4| and |7.4| indicate, basically the movie is not viewable when 
-j value reaches 5 on our system with CFS scheduling. In contrast a viewer can still enjoy 
watching a movie on the same system if CFS/RBPE scheduling is used even if make -j with 
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Figure 4: Demonstrates frame rate change due to system load under CFS scheduler and 
CFS/RBPE. 

value of 12 is used for compiling a linux kernel at the same time. 

8 Concluding Remarks 

In this work we introduce a new policy for CPU scheduling. This policy is based on tracking 
requests sent by customers to different processes and their response to the requests. We 
assume that a computer system should allocate its resources such that the customers do not 
experience excessive delays. We have defined a model which can be used to analyze and 
compare different scheduling algorithms based on this assumption. This model considers 
delays resulted from processes dependencies. When a requests arrives at the system, one or 
more processes are responsible to execute the request and return a response to the customer. 
We detect the request and the processes which are responding to the request by tracking 
interprocess communications. We have a minimal implementation on top of Linux CFS 
scheduler [101 [12] as a proof of concept which increases the priority of the processes involved 
in a response to a customer request. Our experiments show that this mechanism is not only 
effective for improving interactive/desktop applications performance under heavy system 
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load, it is also effective for improving server applications under heavy background load. 



Experiments with Apache web server and Mysql data base server in Sections |7.2| and |7.3 
show significant performance boost for these server applications under heavy background 
load. A server background load may be the result of disk indexing, data base indexing, log 
rotation, log analysis, etc. 

One of our goals is to make the implementation simple, easily portable to different Linux 
kernels and distributions, and easy to use for a novice user. In order to achieve this, we 
use SystemTap [6]. SystemTap has made our implementation a simple script which can be 
run on any SystemTap equipped distribution with a compatible kernel without the need to 
recompile and install a new kernel. There has been a debate and disagreement among kernel 
development community on whether to support SystemTap or not [7J. If the support for 
SystemTap is dropped by its main developers then as far as we know there is no alternative 
for a fast and easy to use implementation as we have done in this work. 

Some of the possible future works are: 

1. Integration of the implementation with the Linux kernel instead of using SystemTap. 

2. Extending the proposed CPU scheduling paradigm to disk scheduling in the sense that 
higher priority process also get higher priority disk access. 

3. Studing the effects of adding other interprocess communications like pipes to the im- 
plementation. 

4. Adding other mechanisms to detect arrival of a request to the system. 

5. Finding ways to give different weights to different requests from a specific customer. 

6. Studing the usage of proposed mechanism in managing resources used by different 
virtual machines on one real machine. 
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